Fast search for Dirichlet process mixture models 
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Abstract 

Dirichlet process (DP) mixture models pro- 
vide a flexible Bayesian framework for den- 
sity estimation. Unfortunately, their flexibil- 
ity comes at a cost: inference in DP mixture 
models is computationally expensive, even 
when conjugate distributions are used. In the 
common case when one seeks only a maxi- 
mum a posteriori assignment of data points 
to clusters, we show that search algorithms 
provide a practical alternative to expensive 
MCMC and variational techniques. When a 
true posterior sample is desired, the solution 
found by search can serve as a good initializer 
for MCMC. Experimental results show that 
using these techniques is it possible to apply 
DP mixture models to very large data sets. 



1 INTRODUCTION 

Dirichlet process (DP) mixture models provide a flexi- 
ble Bayesian solution to nonparametric density estima- 
tion. Their flexibility derives from the fact that one 
need not specify a number of mixture components a 
priori. In practice, DP mixture models have been used 
for problems in genomics (Xing, Sharan, and Jordan, 
2004), relational learning (Xu et al., 2005), data min- 
ing (Daume III and Marcu, 2005) and vision (Sudderth 
et al., 2005). Despite these successes, the flexibility of 
DP mixture models comes at a high computational 
cost. Standard algorithms based on MCMC, such as 
those described by Neal (1998), are computationally 
expensive and can take a long time to converge to the 
stationary distribution. Variational techniques (Blei 
and Jordan, 2005) are an attractive alternative, but 
are difficult to implement and can remain slow. 

In this paper, we show that standard search algo- 
rithms, such as A*, and beam search, provide an at- 
tractive alternative to these expensive techniques. Our 



algorithms allows one to apply DP mixture models to 
very large data sets. Like variational approaches to 
DP mixture models, we focus on conjugate distribu- 
tions from the exponential family. Unlike MCMC tech- 
niques, which can produce samples of cluster assign- 
ments from the corresponding posterior, our search- 
based techniques will only find an approximate MAP 
cluster assignment. We do not believe this to be a 
strong limitation: in practice, the applications cited 
above all use MCMC techniques to draw a sample and 
then simply choose from this sample the single assign- 
ment with the highest posterior probability. If one 
needs samples from the posterior, then the solution 
found by our methods could initialize MCMC. 

2 DIRICHLET PROCESSES 

The Dirichlet process, introduced by Ferguson (1973), 
is a distribution over distributions. The DP is pa- 
rameterized by a base measure Go and a concentra- 
tion parameter a. We write G | Go, a ~ W{Gq,ol) 
for a draw of a distribution G from the Dirichlet pro- 
cess. We may then draw parameters Q\-n \ G ~ G. 
By marginalizing over G, we find that the draws of 
the parameters 9 obey a Polya urn scheme (Black- 
well and MacQueen, 1973): previously drawn values of 
6 have strictly positive probability of being redrawn, 
thus making the underlying probability measure dis- 
crete (with probability one). 

By using a Dirichlet process at the top of a hierarchical 
model, one obtains a Dirichlet process mixture model 
(Antoniak, 1974). Here, one treats the nth parameter 
9 n as being associated with the nth observation, using 
some likelihood function F. This yields the mixture 
model shown in Eq (1). 
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The clustering property of the DP prefers that fewer 
than N distinct values of 9 will be used. If K < N 
values are used, the be seen to be clustered 

into one of K clusters, determined by the 9 values. 

2.1 PROPERTIES 

DP mixture models posses several properties that are 
useful for further analysis. The most basic is their well- 
known property of exchangeability: samples from the 
process are order independent. This leads to efficient 
MCMC techniques (see Section 2.2). An additional 
property of the DP that will be useful for our analy- 
sis is given as Proposition 3 by Antoniak (1974). He 
gives an explicit form for the probability of individ- 
ual clusterings. In particular, suppose that a sample 
Xi, . . . , xn is drawn from a DP mixture model as in 
Eq (1). At most N distinct values of 9 will be used 
(and typically far fewer). Define a vector mi, . . . , mj 
by: TOj is the number of 9s that appear exactly i times. 
Thus, N = J2i i- m i an d Si m i 1S tne total number of 
clusters. Antoniak (1974) gives an explicit formulation 
for the distribution of counts m: 

P(m\a,N)= N (2) 

where denotes the rising factorial function: 

= 1 and = a(a + 1) • • • {a + n - 1). 

2.2 MCMC TECHNIQUES 

When the mean distribution Go of the DP is conju- 
gate to the likelihood function F, one can analytically 
integrate out G and the 9s from Eq (1) and only main- 
tain a vector of cluster assignments, Ci : jv- The vector 
c serves to specify which x n s were generated from the 
same mixture component 9, so that x n is drawn ac- 
cording to F(9 Cn ). Using the exchangeability of the 
DP, a single Gibbs iteration proceeds as follows (Neal, 
1998, Algorithm 3). For each n, we draw c„ to be a 
new cluster with probability proportional to aH(x n ) 
and draw it equal to an existing cluster d with prob- 
ability N- nt dH{x n I {xi I Ci — d,i ^ n}). Here, a is 
the concentration parameter for the DP, N^ n ,d is the 
number of elements of c (other than n itself) that are 
equal to d. Finally, H (x | S) is the posterior probabil- 
ity of x given that S has been observed, Eq (3). 
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By analogy, we will also write H(x\ jq \ S) to denote 
the marginal posterior probability of N data points. 
In general, when working in the exponential family, 
H will be available in closed form and thus the Gibbs 
sampling is efficient. 

Jain and Neal (2004) propose a Metropolis-Hastings 
sampler based on splitting and merging existing clus- 
ters. This algorithm is shown to mix faster than the 
vanilla Gibbs sampler outlined above. It works by first 
randomly choosing two data points. If the data points 
are currently in different clusters, a proposal is created 
that merges the two clusters. If the data points are 
currently in the same cluster, a proposal is generated 
the splits the cluster. Jain and Neal (2004) present 
three variants on this idea. In the first variant, splits 
are determined completely randomly. In the second 
variant, a small Gibbs sampler is run to determine the 
splits. In the third variant, a Gibbs sampler is run 
to determine the splits and the Metropolis-Hastings 
iterations are interleaved with the standard Gibbs it- 
erations described above. 

3 SEARCH 

MCMC is an attractive technique for inference in DP 
mixture models. However, in many real-world cases, 
one does not actually need a sample of cluster assign- 
ments from the posterior, but actually seeks only a 
single cluster assignment. In these cases, a sample 
is extracted from an MCMC run and a single cluster 
assignment — the MAP assignment — is extracted from 
the sample. This raises the question: is Gibbs sam- 
pling a good search algorithm? We show experimen- 
tally (see Section 4.1) that it is often not. 

In general, it will be intractable to find the exact MAP 
solution, as doing so is NP-hard (by reduction to graph 
partitioning). Here, we describe a set of possible search 
algorithms one can apply to DP mixture models for 
finding the true MAP solution for small problems or 
an approximate MAP solution for large problems. The 
generic search algorithm we use is shown in Figure 3. 
It takes an ordered set of data points x\-n, a scor- 
ing function and a maximum beam size. It maintains 
a max-queue of clusterings of prefixes. In each itera- 
tion, it removes the most promising element c from the 
queue and expands it by a single data point. The two 
elements of variability in the algorithm are the scoring 
function and the maximum beam size. 

The search algorithm is guaranteed to find the maxi- 
mum a posteriori clustering if the beam size is oo and 
the scoring function is admissible. In words, g should 
over-estimate the probability of the best possible clus- 
tering c that agrees with c° on the first N a = |c°| 
elements. In equations, we write c |~ N° to denote the 



function DPScarch 


input: a scoring function g, beam size b, data X\._m 


output: a clustering c 


1 


initialize max-queue: Q <— \Q] 

1 v L \/ J 


2 


while Q is not empty do 


3 


remove state Ci-pjo from the front of Q 


4 


if N° = N then return c 


5 


for all clusters d in c and a new cluster do 


6 


let c° = c © (d) 


7 


compute the score s = g(c°, x) 


8 


update queue: Q <— Enqueue(Q, c°, s) 


9 


end for 


10 


if b < oo and Q > 6 then 


11 


Shrink queue: Q <— Qi : 6 


12 


('drop lowest-scoring elements) 


13 


end if 


14 


end while 



Figure 1: The generic DP search algorithm. 



restriction of c to the first N° elements. Thus: 

g(c°,x) > max p(c,x) (4) 

c:cfA r0 =c° 

We further require that when |c| = \x\, equality holds. 

However, for efficiency, it is useful to use scoring func- 
tions g that occasionally underestimate the true poste- 
rior probability. While these functions no longer guar- 
antee that the exact MAP solution will be found, they 
are often more efficient because g can be tighter, even 
if it is not a strict upper bound (see Section 4.1 for 
supporting evidence). 

The posterior probability p(c, x) can be factored as 
p(c)p(x | c). Here, the probability of a cluster vector 
p(c) is given in Eq (2) (the mapping from the c vector 
to the m vector is straightforward). The probability 
of the data given the clusters is given in Eq (5). 

p(x | c) = H H (x c=d ) (5) 
dec 

Here, we write x c= d as shorthand for {x n : c n = d}. 
Our goal is a function g(c° , x) that upper bounds the 
probability of the best clustering c that completes c°, 
as in Eq (5). For brevity, we write N° to be the length 
of c° and K° to be the number of clusters in c°. 

An upper bound can be obtained by independently 
upper-bounding the two terms, p{c) and p(x | c). In 
fact, we do not upper bound p(c) but rather explicitly 
maximize it. The algorithm for this computation is 
given in Section 3.1. The maximization of p(x | c) 
is more complex and cannot be performed explicitly. 



We give three techniques for this maximization. The 
first, trivial computation, is admissible but very loose 
(Section 3.2). The second is tighter but still admissible 
(Section 3.3). The third is tighter yet, but is no longer 
admissible (Section 3.4). 

Matlab code for solving these problems is available on- 
line at http://hal3.name/DPsearch/. 

3.1 MAXIMIZING THE CLUSTERS 

It is possible to explicitly compute the clustering c, 
beginning with c , that maximizes the posterior clus- 
ter probability given in Eq (2). Consider the case of 
adding a single data point to m. (Recall that nij de- 
notes the number of clusters that contain exactly i 
data points.) If this new data point corresponds to a 
new cluster, then m\ will increase by one. If this data 
point corresponds to an existing cluster that already 
contains £ data points, then mi will decrease by one 
and mi + i will increase by one. This gives a change in 
probability for adding a new data point in Eq (6). 

a £ m t 

— > new : — ► £ : - — (6) 

mi + 1 £ + 1 m £+ i + 1 

By the exchangeability of the underlying process, we 
obtain exchangeability on the m vector. Thus we can 
greedily search for completions of an initial m° by ex- 
ecuting one of the two actions in Eq (6) for the re- 
maining N — N a — 1 data points. In particular, for 
N — N° — 1 steps, we find the value of £ (or "new" ) that 
maximizes Eq (6) and modify the corresponding loca- 
tions in the vector. After all steps, we simply compute 
the probability of the m vector according to Eq (2). 

When N is very large, the loop for computing the opti- 
mal m can be time-consuming. One can greatly accel- 
erate the computation by noticing that once the largest 
cluster gets sufficiently large, it will simply continue 
to grow for the remainder of the iterations. Specif- 
ically, if in any step the largest cluster is increased 
from size £-1 to £, and [£me]/[(£+l)(me + i + l)] dom- 
inates a/ [mi + 1], then one can stop the search process 
and simply further increase £ with all remaining ele- 
ments. Additionally, it is helpful to cache previously 
computed values for repeated use. 

3.2 A TRIVIAL FUNCTION 

Given a clustering c° of the first iV° elements, we can 
trivially upper bound p(x \ c) from Eq (5) by consid- 
ering only the first N° data points. For this admissible 
scoring function, we use Eq (7). 

OTrivial(a; I C °) — n H{ x c«=k) (7) 

kec" 



Using the trivial scoring function is essentially equiv- 
alent to just using a path cost with a zero heuristic 
function in standard A* search. As such, we expect 
this scoring function will lead to an inefficient search. 

3.3 A TIGHTER FUNCTION 

The inefficiency of the trivial scoring function given in 
Section 3.2 is due to the fact that it does not take into 
account any of the unclustered data points. We can 
obtain a tighter scoring function by accounting for the 
probability of the as yet unclustered data points. We 
do this by simplifying the maximization as follows. 



3.4 AN INADMISSIBLE FUNCTION 

The foregoing scoring functions are attractive because 
they provably lead to optimal clusterings. However, 
this optimality comes are a price: the NP-hardness 
of the search problem implies that they will often be 
inefficient. This inefficiency comes from their lack of 
tightness. Here, we give a very simple scoring func- 
tion that is significantly tighter, and therefore leads to 
much faster clusterings. Unfortunately, this heuristic 
is no longer admissible and therefore the search is no 
longer guaranteed to find the globally optimal solution. 
The function we use is given in Eq (12). 



max p(x | c) 
= max Y[ H ( x c=k) (8) 

c\N°=c" kec 
N 

= max j [ H(x n | x Cl:n _ 1=Cm ) (9) 

c\N a =c" n=l 
N° 

= l[H(x n \x c o n _ i=c J (10) 

n=l 

N 

^ II H ( Xn I X Cl:n-l=C m ) 

c\N°=c° n=N° + l 

<5Trivial(a; | C°) (11) 

N 

nmax max H(x n | x Cl . n _ 1=Cm ) 
Kk<K° + l <=: v ln 

n=N° + l c\N"=c" 

c n —k 

The key idea for this scoring function is to treat each 
as-yet unclustered data point independently. In partic- 
ular, for some n > N a , we know that it will either fall 
into one of the K clusters that exist in c°, or it will fall 
into a new cluster. So, for each unclustered point x ni 
we choose a value k for which cluster it falls in to. We 
then must cluster all remaining points x N o +1 . . .x n ^i 
as to only whether they fall into cluster k. 

Despite the fact that this latter maximization is 
simpler, it is still not tractable (effectively because 
H(x | S) is not monotonic in S, even for the expo- 
nential family). The solution we propose is to replace 
each remaining x m (N° < m < n) with a replica of 
x n } In this way, we do obtain monotonicity of H and 
may simply keep adding replicas of x n to S so long as 
this increases the cluster probability. 2 

1 We do not use an actual replica: we use a scaled replica 
with norm equal to the maximum norm of all x. 

2 In certain cases, it is possible to determine the num- 
ber of copies that should be added, analytically. For the 



N 

5lnad(a: | C°) =#Trivial(a? | C°) ]J H(x n ) (12) 

n=N° + l 

This scoring function uses the true probability of the 
existing clusters and then assigns each new data point 
to a new cluster. This is inadmissible because, for 
example, if the last two data points are identical, it 
would be preferable to cluster them together. 

When using this scoring function, the order in which 
the data points is presented becomes important. We 
have found that a useful heuristic is to order the data 
points by increasing marginal likelihood. This is likely 
because examples with high marginal likelihood are 
more likely to be in their own clusters anyway, and 
hence the heuristic is better. (See Section 4.3 for some 
experiments comparing ordering strategies.) 

4 EXPERIMENTAL RESULTS 

We present results on three problems, one based on 
artificial data, one based on the MNIST images data 
set and one based on the NIPS papers data set. All 
experiments were run on a 3.8GHz Intel Pentium 4 
machine with 4Gb of RAM. 

4.1 ARTIFICIAL DATA 

Our first set of experiments are on artificial data 
to demonstrate the scaling properties of the search 
methods and to compare them directly to Gibbs 
sampling. For these problems, we generate a data 
set according to a Gaussian/Gaussian DP mixture 
model with prior mean zero and prior variance 10. 
We generate data sets of increasing size (N G 

Dirichlet/Multinomial pair, all should be added. For the 
Gaussian/Gaussian pair, one should add sufficiently many 
copies of Px n (where /3 > 1 is the scaling factor) so as to 
fully move the posterior mean to lie at x n and then add 
copies of x n without the scaling factor. 
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Figure 2: Results on artificial data set; x-axis is always the size of the data set. (UL) y-axis is the ration of the 
data negative log likelihood to the data negative log likelihood found by exhaustive search (lower is better); (BL) 
y-axis is f-score (higher is better); (UR) y-axis is computation time in seconds (lower is better); (BR) y-axis is 
number of elements enqueued during search (lower is better) . 



{4, 6, 8, 10, 11, 12, 13, 14, 15, 20, 25, 30, 50}). We first 
run Gibbs sampling and split-merge on each data set. 
We then run six search algorithms. The first three 
search algorithms run full search using the three scor- 
ing functions described above. The last three search 
algorithms use the three scoring functions described 
above, but with a maximum queue size of 10. For 
each data set size, we generate 10 data sets and aver- 
age results across these. 

Comparison between a sampling approach and a non- 
sampling approach is difficult. We perform our com- 
parison to be as unfair to our proposed approach as 
possible: i.e., we try to make sampling look as good as 
possible. We run both samplers as follows. We do fif- 
teen runs of the sampler for 1000 iterations each. The 
first five are initialized with a single cluster; the second 
five are initialized with N clusters; the last five are ini- 
tialized randomly with log TV clusters. For each run, 



there will be a sample that achieves the highest log 
likelihood. We choose this iteration as the stopping 
point. The best of the best log likelihoods over the 
15 runs is then reported as the final score. The time 
reported is the time it took to get to that log likeli- 
hood in the single run. That is, the numbers reported 
are overly optimistic in terms of time, and in line 
with practice in terms of performance. For the split- 
merge algorithm (Jain and Neal, 2004), we only use the 
third variant that performs intermediate Gibbs sam- 
pling. The other split-merge algorithms fared surpris- 
ingly poorly — often losing significantly even to stan- 
dard Gibbs in terms of log likelihood. 

The results of these experiments are shown in Figure 2. 
The upper-left graph shows a plot of the data nega- 
tive log likelihood as a function of data set size (lower 
is better). The negative log likelihoods are presented 
as a ratio to the log likelihood of the true MAP as- 



signmcnt. Interestingly, both inadmissible search algo- 
rithms achieve optimal performance. The beam-based 
admissible heuristics do worse, and the Gibbs sam- 
pler never reaches optimal performance (except on the 
smallest data set). The split-merge performance is 
quite variable: sometimes better than Gibbs, some- 
times worse. 

In the upper-right graph, we plot computation time 
as a function of data set size. Split-merge is almost 
always slightly slower than Gibbs. As can be eas- 
ily seen, the time taken for both admissible heuristics 
grows quite fast and, at least for the trivial heuristic, 
becomes intractable after only 10-15 data points. The 
Trivial-Beam remains reasonably efficient (on par with 
Gibbs), but also had the worst negative log likelihood 
performance. The Admissible-Beam is actually rea- 
sonably slow because the computation of the heuris- 
tic is time-consuming. As expected, the inadmissible 
heuristic (with or without beam) is always the fastest. 

The bottom-left graph shows the f-scorc (geometric 
mean of precision and recall) of pair-wise "same clus- 
ter or not" decisions for the clustering found by the 
different algorithms against the ground truth (higher 
is better) . As we can see, the inadmissible heuristic al- 
ways performs best, while the others vary significantly 
in performance (Gibbs, for some reason, does quite 
poorly initially but then improves as the size of the 
data set increases). 

Finally, the bottom-right graph shows (for only the 
search-based algorithms) the number of states en- 
queued during search. This corresponds roughly with 
the upper-right graph (time) but excludes considera- 
tions such as the time to compute the heuristic. As ex- 
pected, the inadmissible heuristic enqueues by far the 
fewest entries (in fact, for most of these algorithms, 
the number of elements dequeued by the inadmissible 
heuristic was between N and N + 5 for all algorithms, 
meaning that pure greedy search may be reasonable). 

4.2 HANDWRITTEN DATA 

For this experiment, we use the handwritten data 
set from MNIST (specifically, the version assembled 
by Sam Roweis) consisting of images of numbers (in 
28 x 28 = 784 dimensions). Following Kurihara, 
Welling, and Vlassis (2006), we preprocess the data 
by centering and sphcrizing it, then running PCA to 
obtain a 50-dimcnsional representation. This data set 
consists of 60,000 images, and thus only the inadmis- 
sible heuristic is sufficiently efficient to run (we use a 
beam of 100, though this turns out to be unnecessar- 
ily large). We run on three versions of the data: a 5% 
subset, a 20% subset and the full set. In all cases, we 
use a = 1 and a prior variance of 0.1. 
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Figure 3: Mean images for clusters found on the full 
(60fc examples) MNIST data set. 



At the 5% level, search completes in just over 11 sec- 
onds, roughly 270 data points per second. The algo- 
rithm finds a solution using 11 clusters and achieves 
a negative log likelihood of 2.04e5. Running Gibbs 
on this data set takes roughly 40 seconds per itera- 
tion, and after 100 iterations achieves a best perfor- 
mance of 2.09e5. Split-merge obtains a log-likelihood 
of 2.05e5, but takes almost two hours to do so. At 
the 20% level, search completes in just over 105 sec- 
onds (115 elements per second) with a final negative 
log likelihood of 8.02e5 (and 21 clusters). Gibbs sam- 
pling takes roughly 18 minutes per iteration and find 
a best solution of 8.34e5 (after 100 iterations). Split- 
merge obtains a solution with negative log likelihood 
of 8.15e5 after eight hours. Finally, for the full data 
set, search completes in just under 15 minutes (roughly 
66 data points per second) with a score of 3.96e6 (and 
27 clusters). Gibbs takes an egregious 7 hours per it- 
eration on this data, so we were only able to run 15 
iterations, leading to a best score of 4.2e6. Split-merge 
was also to slow to run for more than 15 iterations, 
achieving a score of 4.1e6 after about 7 days. 

In Figure 3, we show the mean image from each clus- 
ter found by the search algorithm on the full data 
set (sorted by cluster size). Qualitatively, these clus- 
ters appear quite reasonable. (Kurihara, Welling, 
and Vlassis (2006) present a similar figure for vari- 
ational techniques; however, their model uses a full 
Gaussian/inverse- Wishart prior while ours only uses a 
Gaussian and we use a fixed prior variance. One could 
easily adapt our algorithms to use the full prior, but 
we do not do so in the experiments reported here.) 

4.3 NIPS DOCUMENTS 

Finally, to demonstrate the applicability of our algo- 
rithm to discrete data, we apply the same model to 
the set of papers from NIPS 1-12 (assembled by Sam 
Roweis). This data set consists of 1740 documents over 



training hidden units error learning weight generalization net- 
work weights regression layer algorithm recurrent gradient 
nodes prediction theorem convergence node student 
spike neuron neurons cells synaptic firing cell cortical cortex 
activity synapses stimulus excitatory orientation inhibitory 
membrane spikes ocular dominance fig 

mixture speech em likelihood image tangent hmm cluster- 
ing word images pea recognition posterior bayesian speaker 
gaussian experts kernel cluster classification 
chip circuit analog motion image voltage vlsi visual auditory 
images velocity sound retina intensity optical pulse disparity 
template pixel silicon 

policy reinforcement state controller action control actions 
robot reward agent mdp learning states sutton policies tra- 
jectory planning singh barto rl 

object objects units views face image hidden unit visual im- 
ages recognition attractor activation faces features layer at- 
tention feature representations module 
bounds threshold polynomial bound theorem depth boolean 
proof gates lemma dimension tree node concept clause class 
functions winnow boosting maass 

motion head motor visual eeg eye direction subjects stimu- 
lus ica movements cells movement cortex velocity cue field 
parietal vor spatial 

word classifiers classifier hmm rbf character recognition train- 
ing speech mlp characters hybrid user context net hidden 
layer trained words error 

belief evidence similarity posterior bayesian retrieval user 
propagation hypotheses query concept approximation exam- 
ples exact images iii image mackay database nodes 

Table 1: Most indicative words for each of the ten 
largest clusters (out of fourteen) of documents from 
NIPS papers. Sorted by cluster size. 

a vocabulary of roughly 13/c words. In our model, we 
drop the top ten words from the vocabulary and retain 
only the top 1000 from the remainder. For this data, 
we use the Dirichlet/Multinomial DP mixture model 
with a = 1 and a symmetric Dirichlet prior with pa- 
rameter equal to 10. 

Running DP search (with the inadmissible heuristic) 
on this data set takes roughly 21 seconds (83 docu- 
ments per second) and results in fourteen clusters. For 
each cluster, we extract the top twenty words from the 
documents assigned to that cluster ("top" as deter- 
mined by tf-idf score). These are depicted in Table 1 
(sorted by cluster size). 

We also experiment with alternative ordcrings of the 
data set for this problem. When the data is presented 
in ascending order of marginal likelihood, the result- 
ing log likelihood is 2.4407e6. When this order is re- 
versed, the resulting log likelihood is 2.4740e6. Finally, 
we consider presentations in random order. Over ten 
such orders, the mean log likelihood is 2.4489e6 and 
the variance is 0.0067e6. (Of all the random passes, 
only one achieves a higher log likelihood than the de- 
fault ascending order, and does so with a small gain: 
2.4403e6.) This suggests that the ascending order is a 



reasonable heuristic. 

In comparison the vanilla Gibbs sampling and the 
split-merge proposals, the search algorithm again per- 
forms significantly better. The log likelihood for the 
best Gibbs clustering on this data was 3.2e6 and for 
the split-merge proposals it as 3.0e6, both taking a 
little over an hour. 

5 PRIOR WORK 

Although in this paper we have only compared our 
search-based technique to straightforward Gibbs sam- 
pling, there are other inference techniques for DP mix- 
ture models. Still within the context of MCMC, Xing, 
Sharan, and Jordan (2004) propose a Metropolis- 
Hastings sampler that is shown to mix faster than 
the Gibbs sampler and is only moderately more chal- 
lenging to implement. An alternative, recent proposal 
for inference in DP mixture models is to make use of 
particle filters (sequential MCMC) (Fearnhead, 2004). 
Particle filters look somewhat like a stochastic beam 
search algorithm and, as such, are similar in spirit to 
the approach proposed here. 

We are additionally aware of two deterministic ap- 
proaches to inference in DP mixture models based on 
variational techniques. The first, due to Blei and Jor- 
dan (2005), employs the stick-breaking construction 
for the Dirichlet process (Sethuraman, 1994) to con- 
struct a finite variational distribution for the infinite 
mixture model. On artificial data, they report on the 
order of a 100 — 300 decrease in time (over Gibbs sam- 
pling) with essentially identical held- out data log like- 
lihoods. Very recently, Kurihara, Welling, and Vlas- 
sis (2006) present even more efficient variational al- 
gorithm for DP mixture models. In contrast to the 
method of Blei and Jordan (2005), the new algo- 
rithm employs an infinite variational distribution and 
a collapsed distribution (Kurihara, Welling, and Teh, 
2007). It can be further accelerated (at least in the 
Gaussian case) by using kd-trees for caching sufficient 
statistics of the data set. 

6 CONCLUSIONS 

We have presented an algorithm for finding the MAP 
clustering for data under a Dirichlet Process mixture 
model, a task that appears regularly in many practical 
applications. It has been shown to be extremely effi- 
cient (clustering a data set of 60fc elements in under 
15 minutes in Matlab) and general (we have shown ap- 
plications both to continuous and discrete data sets). 
Moreover, for small data sets, we have shown relatively 
efficient schemes for finding provably optimal solutions 
to the MAP problem. Our results, especially with the 



inadmissible scoring function, show that one can ob- 
tain a very good approximate MAP solution incredibly 
quickly. Profiling shows that our main bottleneck is ac- 
tually optimizing p(c) from Section 3.1, not p(x | c). 
In the worst case, this optimization is quadratic in the 
size of the data set, not linear. We are currently in- 
vestigating ways of making this more efficient. 

In comparison to variational approaches to DP mix- 
ture models (Blci and Jordan, 2005; Kurihara, 
Welling, and Vlassis, 2006), our algorithm is applica- 
ble to exactly the same data (exponential family with 
conjugate priors) and suffers from the same drawback: 
one cannot easily re-estimate the concentration param- 
eter. However, given the speed of our algorithm, one 
could easily use multiple runs with Bayesian model se- 
lection to find a suitable value. It should be noted that 
the results presented in papers discussing variational 
approaches compare to Gibbs sampling in terms of log 
likelihood and speed. The general result is that the 
variational approaches obtain similar log likelihoods 
about 100 — 300 times faster. Our results with the 
inadmissible function show that we can often achieve 
better results to Gibbs. It is difficult to imagine an 
algorithm that is more computationally efficient than 
search with our inadmissible function. 

The primary advantage of MCMC techniques over our 
method is that they produce a true representation 
of the posterior, provided that they are run for long 
enough 3 However, if a true sample of the posterior is 
desired, it would be natural to run our algorithm as 
an initializer for any MCMC algorithm. This would 
yield the benefits of a sample from the posterior with- 
out requiring that the sampler first find a region of 
high posterior probability. Additionally, MCMC tech- 
niques can be applied to non-conjugate distributions 
(at least in theory) by using an embedded sampling 
procedure to estimate the intractable integrals. One 
could, in principle, use the same methodology within 
our search framework, though this would obviate many 
of the speed benefits. An alternative would be to use 
an efficient deterministic approximation. 
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